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Abstract 

We examine the problem of family size statistics (the number of individuals carrying the same surname, or 
the same DNA sequence) in a given size subsample of an exponentially growing population. We approach the 
problem from two directions. In the first, we construct the family size distribution for the subsample from 
the stable distribution for the full population. This latter distribution is calculated for an arbitrary growth 
process in the limit of slow growth, and is seen to depend only on the average and variance of the number 
of children per individual, as well as the mutation rate. The distribution for the subsample is shifted left 
with respect to the original distribution, tending to eliminate the part of the original distribution reflecting 
the small families, and thus increasing the mean family size. From the subsample distribution, various bulk 
quantities such as the average family size and the percentage of singleton families are calculated. In the 
second approach, we study the past time development of these bulk quantities, deriving the statistics of the 
genealogical tree of the subsample. This approach reproduces that of the first when the current statistics of 
the subsample is considered. The surname distribution from th e 2000 U.S. Census is examined in light of 
these findings, and found to misrepresent the population growth rate by a factor of 1000. 
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1. Introduction 

There is a long history of work in the social 
sciences on family size distributions. The classic 
founding work in this field is that of Galton and 
Watson (GW) [3] who tried to explain the decline 
of the British great families, as indicated by data on 
surname abundance. Rejecting previous explana- 
tions based on "fitness", e.g., that the rise of phys- 
ical comfort is followed by fertility decline, they as- 
sumed that the phenomenon is purely statistical. 
The affiliation of an individual with certain family, 
expressed in his/her surname, was assumed by GW 
to be a neutral property. This feature is inherited 
to the next generation according to a well defined 
rule (all offsprings take the surname of their father) 
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and is subject to the stochasticity that character- 
izes birth-death processes. Assuming a well-mixed 
population, GW claimed that all surnames undergo 
extinction in the long run. In fact, their conclu- 
sions were correct only for an equilibrium popula- 
tion, whereas for a growing population, their equa- 
tions exhibit a second nontrivial solution which was 
found by Steffenson [T31 E] and exploited by Lotka 
|Hl|9j using U.S. census data to deduce the offspring 
distribution. Subsequently the impact of surname 
changes ( "mutations" ) was considered by Manrubia 
and Zanette (MZ) (TU]. All in all, the surname in 
a society undergoes a birth-death-mutation process, 
and the current surname abundance distribution re- 
flects the demographic (birth-death ratio) and so- 
cial (mutation rate) characteristics of the popula- 
tion. MZ also presented data for the distribution 
of surname in the populations in Argentina, Berlin, 
and five cities in Japan, where the statistics were 
obtained from phonebooks. The data exhibited the 
predicted l/n^ behavior at large n for the prob- 
ability of n appearances of a surname. MZ then 
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attempted to use the deviations for smaller n to 
deduce the growth rate of the population. 

As already pointed out in Manrubia et al. [11], 
the importance of the clan statistics for a popu- 
lation that undergoes a birth-death-mutation pro- 
cess goes far beyond its applicability to surname 
dynamics. Any neutral genetic feature associated 
with a sequence that appears on certain loci and 
is subject to mutations undergoes exactly the same 
process, thus the results for surnames reflect also 
the amount of genetic polymorphism in the popu- 
lation. Another neutral process of the same kind 
was suggested by Hubbell [3 |6l |2] and Bell [2] as 
the underlying mechanism that yields the observed 
species abundance distribution. This heretical idea 
opposes the traditional "niche" theories that seeks 
to explain species abundance ratios in terms of in- 
terspecies interaction and fitness, and ignited an 
enormous contentious debate on that subject [12]. 
The argumentation of both sides is based on the 
species abundance statistics, as gathered in large- 
scale censuses [3] and the very same problem arises: 
what statistics should one expects in case of a grow- 
ing or shrinking population which is subject to neu- 
tral mutation? 

Before trying to compare the observed statistics 
with some theoretical predictions (e.g., in order 
to recover demographic parameters from the abun- 
dance ratio) one should address two crucial issues. 
The first is universality: to what extent should one 
expect the results to be independent of the "micro- 
scopic" features of the process? Fig. 1 shows the 
family size statistics (Pareto plot) obtained from 
numerical simulations of two populations with the 
same demographic characteristics. The dynamics 
assumes nonoverlapping generations, where the av- 
erage number of offspring per individual is A > 1 
and the chance of an offspring to mutate (i.e., to 
differ from its originator and to start a new clan) 
is II. Both populations have the same values for 
A and /i, but they differ microscopically: in one 
case the chance for an individual to produce n off- 
springs obeys the Poisson distribution with aver- 
age A, in the other case it satisfies the geometrical 
distribution with the same average. As one can 
clearly see, the tails of these distribution coincides, 
but the bulk abundance statistic is different; this 
implies that the theory of abundance ratio has no 
use for any practical purpose unless one knows the 
very fine details of the demographic properties of 
the population throughout history, an inconceiv- 
able task in almost all circumstances. A comparison 




Family Size 

Figure 1: Family size distribution for a population of 4- 10^, 
for the parameters A = 1.005, fi = 5 ■ 10~^, for a Poisson 
and for a geometrical distribution of offspring. The data was 
averaged over 100 runs and binned into bins which contained 
a minimum of 1000 families over the 100 runs. One sees that 
the large families are distributed in both cases as a power 
law, as in the MZ model. The power-law cuts off at small 
family sizes, below sizes of roughly 1000, at which point the 
two distributions diverge. Inset: A blowup of the figure for 
small family sizes, highlighting the difference between the 
two distributions. 

between experimental data and theoretical predic- 
tions is possible if, and only if, one can show that 
there is a universal regime in which the statistics is 
independent of the microscopic details; this is one 
of the aims of this paper. 

The second issue that should be addressed is the 
effect of sampling. In all cases considered above 
- surnames, genetic polymorphism, species abun- 
dance - the raw data is made of individuals sam- 
pled from the whole population together with their 
affiliation with certain surname or certain species. 
It is difficult to perform a complete census, given 
that typically one does not have access to the en- 
tire population. Thus for example, MZ only looked 
at the surnames beginning with "A" in the Berlin 
phone book. Even for the US census data, one has 
access to (almost) the entire population only un- 
der the assumption that the US is demographically 
isolated, which it is clearly not. In the application 
we have in mind, that of looking at genomic data 
to measure historic growth rates of the population, 
one has such data for an extremely small sample of 
the entire human population. 

What is the effect of incomplete sampling? In 
Fig. |2] one can see the characteristic features of 
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Figure 2: Avg. number of families of a given size, for the full 
population of 4 ■ lO'', and subsamples of size 2 ■ 10^5, 2 ■ 10'', 
2 ■ 10^ and 200. The growth rate is 7 = 0.005, and the 
mutation rate is /i = 5 ■ 10~^, and the child distribution is 
Poisson. The circles represent averages over 100 iterations. 
The lines are the theory for the full population, Eq. ^ and 
for a "red" subsample, Eq. (jsjl. The deviations from the 
power law for the largest m's seen in the No = 200 and 2000 
data are due to the fact that the largest family does not 
obey the stable distribution, but rather reflects the single 
individual initial conditions chosen [lOJ. 

the family statistics obtained in the two regimes: 
strong and weak samphng. One can see that the 
full statistics is characterized by a "shoulder" in 
the small families region, followed by a power-law 
decay for large families. If the sampling is strong 
the distribution is shifted quite rigidly to the left, 
while the case of weak sampling is characterized by 
a peak for the singletons (families with only one 
member) followed by a power-law. Our second aim 
here is to clarify effect in both regimes. 

In the following, we analyze the problem from 
two different angles. The first is centered on the 
stable distribution for the entire population. This 
distribution can be calculated from a Fokker-Planck 
equation, akin to that written down by MZ. We 
show this Fokker-Planck equation is in fact univer- 
sally valid in the limit of small growth rate, for an 
arbitrary distribution of children produced by an 
individual, with the coefficients depending only on 
the average and variance of the children distribu- 
tion, together with the mutation rate. From this 
we can calculate the distribution for a given sized 
subsample of the population in terms of a hypergeo- 
metric function. We then endeavor to assimilate the 
meaning of this result, focussing on the strong and 



weak sampling limits, exhibiting simpler formulae 
for the average family size and number of singleton 
families We also show that the large-family power- 
law asymptotics is left unchanged by the sampling. 

The second approach is based on looking at the 
behavior of the genealogical tree of the selected 
sample. We calculate the size of the tree as a func- 
tion of time, as well as the number of families and 
singletons, all in the limit of small mutation rate. 
These results are seen to reproduce those of the 
previous approach for the current statistics of the 
selected sample. 

The plan of the paper is as follows. In Section [2] 
we describe our model, explain the notation used 
along this work and highlight our main results. In 
Section III, we present our derivation of the fam- 
ily size distribution for the subsample, and calcu- 
late the average family size and number of singleton 
families. In Section IV, we present our second ap- 
proach. Finally, in Section V we examine the sur- 
name distribution taken from the U.S. census data 
in light of our findings. We then summarize and 
present some concluding remarks. 

2. Model, simulation technique and main 
results 

Our basic model is that of a growing population 
with nonoverlapping generations, as in the original 
Galton- Watson work. Every member of the popula- 
tion simultaneously gives birth to a random number 
of children, drawn from a given distribution with 
mean A, and is then removed. The children are 
all reckoned to belong to the same family as the 
parent, unless they undergo a mutation at birth, 
which occurs with probability /i. The mutated child 
is considered to start a new family. We start the 
population with one individual, repeating the ex- 
periment until the population survives the initial 
stages and achieves the desired size. In principle, 
we could track the genealogy of every individual. In 
practice, for efficiency's sake, we track only the ge- 
nealogy of families, which is sufficient to determine 
the family identification of every individual. Thus, 
it is sufficient to draw the number of children of 
each family. In our simulations, we mostly employ 
a Poisson distribution for the number of children, 
occasionally comparing to the case of a geometri- 
cal distribution. In the former case, the distribu- 
tion of children of a given family of size n is again 
Poisson, with mean A, whereas for the geometrical 



distribution, it is a Pascal (or negative binomial) 
distribution. 

As a technical point, we will be interested in Sec- 
tion V in the genealogical tree of subsamples of the 
population, so that we can track the time develop- 
ment of the statistics of this tree. We can do this 
retrospectively for the Poisson case, simply picking 
ancestors for each individual among the set of in- 
dividuals in the family in the previous generation 
which gave rise to this individual (which is the same 
family, barring mutations). This is done to avoid 
having to store the genealogies of individuals, which 
are of course more voluminous than those of fami- 
lies. 

Glossary: The growth rate 7 = A — 1 reflects the 
deviation of the process from demographic equilib- 
rium. In general, as discussed above, the distribu- 
tion function depends on the details of P(m), the 
chance of an individual to have m children in the 
next generation. It turns out, however, that in the 
universal regime the family statistics depends only 
on three parameters: 7 (or equivalently A), which 
reflects the average number of offspring per indi- 
vidual, (7, the standard deviation of the offspring 
distribution defined as 



rn?P{m) - A^ = Var(m), 



(1) 



and /X, the mutation rate. For convenience, we de- 
fine 

(2) 

as this combination appears often. 

The number of families with m members is de- 
fined as Tim- These definition implies that the sum 
of mn,„ over all m's yields the overall current size 
of the population. No- Except for m's of order 
unity, one may consider the size of the family as 
a continuous variable, thus replacing to by a; and 
Um by n{x). When the sampling is incomplete we 
tag the sampled individuals as "red", defining 
as the abundance of families with m individuals in 
the red (sampled) population [When dealing with 
the whole genealogy we define a " red" subgenealogy 
consisting of all the individuals that have at least 
one descendent in the sampled population]. Sam- 
pling introduces a new parameter to the problem, 
Ro, the number of sampled ("red") individuals. It 
turns out that there is a "critical" sample size which 
distinguishes between weak and strong sampling: 



Rr 



(3) 



We then measure sampling strength, s, through 
_ Ro Ro<T^il + iy) 



Rc 2No-/ 

Our main results are: 

1. In the large to limit n„ 
law, rim ^ m^^ where 



(4) 



decays like a power- 



lnA^(l-/i) 
lnA(l -^) 



(5) 



This law is semi-universal, in that it is in- 
dependent of the details of P{m). It how- 
ever depends on the assumption of nonover- 
lapping generations, and therefore differs from 
the power law found by MZ. It does however 
reduce to the MZ result, (3 ~ 2 + i^, in the slow 
growth, small mutation limit 7 /i ^ 1. 
The whole distribution (except for the very 
smallest to's) becomes universal if both /j, and 7 
are small. In that case n{x) satisfy the Fokker- 
Planck equation: 



dn 



<1- 



M)^(xn) 



2 dx- 



{xn). (6) 



A similar equation has been obtained by MZ 
for their particular model; here we show that 
it is a universal limit of the process for small 
rates, and also reveal its dependence on a. 
Solving for the steady state distribution of ([6]) , 
the abundance distribution function is: 



n{x) 



vR, 



r (2 + iy) [/ 1 + z/, 0, 
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a2(l 



^) 
(7) 

where U is the Kummer function Thus, 
the abundance distribution for different micro- 
scopic processes with the same 7 and ^ are 
related by a rescaling of the family size to and 
the abundance n, na^ being a universal func- 
tion of to/ct^ (since Rc oc l/cr^). We see this 
in Fig. |3] 

For the sampled ("red") population, is 
given by the monstrous expression: 

« vRcQ (2 + V, to) s" 2Fi{m,m + l]m + 2 + v;l - s) 



where B(a,5) = T{a)T{b) lT{a + h) is the Beta 
function and 2F1 is the hypergeometric func- 
tion [1], and s is the sampling strength intro- 
duced above. To digest this, we focus on two 
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Figure 3: Scaled average number of families ncr* as a func- 
tion of scaled family size m/a^ for the Poisson and geometric 
offspring distributions. The data is the same as in Fig. [T] 
Also shown is the analytic prediction, Eq. 



limits, that of strong and weak sampling. In 
the strong sampling limit, jNo <^ Ro No, 
so that s ^ 1, 



r(2 



-U{l + v,0,m/s) (9) 



vR, 



Thus, since s is proportional to i?o, when vary- 
ing Ro, Ron^ is a function only oi m/Ro, and 
the dependence of Ro just amounts to rescal- 
ings of m and n^. This implies that in this 
limit the breakdown of the asymptotic power- 
law occurs at m's of order Nq^/Ro, and in 
general sampling induces a rigid displacement 
of the family size distribution to the left and 
downward. For strong but partial sampling the 
formula applies all the way down to m = 1, 
whereas for the full population, the formula 
breaks down for the smallest m's. For small 
argument, U approaches a constant, and so for 
m ^ s, we get 



lyRr 



(10) 



For large arguments, we recover the standard 
power-law. This is evidenced in Fig. |4] 
For weak samphng, Ro <C 7-/V0, i.e., s ^ 1, 
the samphng strength decouples from the dis- 
tribution for m > 1 except to set the overall 
normalization: 



B{2 + iy,m-l-iy)i^Ros'' 



m > 1 
(11) 



Figure 4: Data collapse for strong sampling: Ro 
function of m/Ro for various sized samples, Ro = 6.4 ■ 10^, 
3.2-105, 1.6-10'' and 8-10*, along with the whole population 
R^ = No = i-W^. Also shown are the small and large s 
power law predictions. 



This is demonstrated in Fig. [Sj The distri- 
bution rapidly approaches the expected power- 
law behavior from above as m increases. Thus, 
the shoulder has disappeared completely. The 
families of family size 1, which we denote sin- 
gletons, are exceptional for weak sampling, 
since the chance of sampling more than one 
member from a given family vanishes in the 
small s limit, except for small (scaled) muta- 
tion rate v, where there are anomalously large 
numbers of large families. 
The average red family size, m^, is given by 
the equally monstrous formula 

s{l + v) 



2i^i(l,l;3 + i/; 1)-(1 



s)2i^i(l,l;3 
— (12) 

This is shown in Fig. |6| where is shown 
as a function of Ro, together with the results 
of numerical simulations. For strong sampling, 
there of order In s red families, and 

— B '"^ 



... (13) 
vmas 

where a is a, v dependent constant which ap- 



proaches unity for small v, given by Eq. (38 1. 
In particular, in the full sample, s = a'^{l + 
I/) 727, and the average family size is large, of 
order —1/(7;/ In 7) 

For weak enough sampling, the average fam- 
ily size approaches unity, since all families are 
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Figure 5: Data collapse for weak sampling: n^/ijj^"^ as a 
function of m for various sized samples, Ro = 2 00, 2 000 and 
2 ■ 10^, along with the analytic prediction, Eq. 
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6: Average family size as a function of the sub- 
size Ro for 7 = 0.005, = 5 ■ 10"'', and A^o = 4 • 10®. 



singletons. For small v however, this again oc- 
curs only for extremely small samples, and in 
practice, 

^c^.-l^ (14) 



v\tis 

so that for small v the average red family size 
is large, and decreases logarithmically as the 
sampling strength decreases. 
6. The distinction between weak and strong sam- 
pling is also reflected in the statistics of the 
genealogical tree of the subsample. For strong 
sampling, this is a strong coalescence at first 
as we must upward in the tree. For weak 
sampling, on the other hand, the coalescence 
is very small initially. Eventually, both trees 
narrow exponentially. For critical sampling, 
the narrowing of the tree is exponential for all 
times. 

This then is the general outline of our results. In 
the next section, we turn to a detailed derivation of 
these findings. 



3. General analysis 
mutation process 



of the branching- 



We start with the family size distribution for the 
entire population. This is given by the solution to 
a Fokker-Planck equation generalizing that derived 
by MZ for their model of overlapping generations. 
We start with the set of difference equations for the 
evolution of the whole-population distribution: 



(15a) 



J2 niP{i ^ P) i^^j fi^-'il - /i) + f^N{t + 1) 

(15b) 



The first equation represents the contribution of a 
family of size £ giving birth to a family of size p, 
m — p are whom mutate, leaving a family of size 
m. The probability P(i m) of £ individuals to 
give birth to m children is derived in term of the 
fundamental distribution of the number of children 
of a single individual, P(m), whose mean we label 
A. All mutations become new families of size one. 

Before we present the differential equation, we 
can verify that asymptotically for large m (item 1 
above), the stable distribution falls like a power. 



(16) 



This can be accomplished by using the central limit 
theorem and evaluating the sums via the Laplace 
method. The details are this calculation are pre- 
sented in Appendix [X] We find that indeed ^ 
where 

lnA2(l~/i) 
lnA(l -/i) 

This is difi^erent from the MZ result, and reflects 
the difference between the overlapping generations 
in their model versus the synchronized update of 
this model. Nevertheless, writing the growth factor 
A = 1 + 7 and going to the slow growth, small 
mutation limit 7 ^ /i <C 1, we get to leading order 



7-pi 



(17) 



This is the MZ result, indicating that for overlap- 
ping generations the growth rate is always small in 
some sense. In any case, we conclude that when 
the growth and mutation are sufficiently small so 
that the population changes slowly, the details of 
the update procedure no longer matter. 

By using the generating function for the child 
distribution, we can derive, as detailed in Appendix 
[B| the Fokker-Planck equation (item 2): 

This equation is approximate. In particular, the 
coefficients are presented, in light of the discussion 
above, to leading order in and 7. In addition, we 
have truncated the equation at terms of second or- 
der. There are additional second derivative terms, 
which arise from the third "spatial" derivative of 
xn{x), which we have as dropped, as do MZ. The 
first order derivative represents the drift to larger 
population, with an effective growth rate for the 
family of 7 — /i, since mutations reduce the family 
size. The coefficient of the second derivative term, 
cr^, is the variance of the given children distribution 
of an individual. It is eminently reasonable that the 
variance in the number of children is what gives 
rise to diffusive behavior of the family size. For the 
case of the geometric distribution of children, which 
characterizes the MZ model, with variance 2 in the 
small 7 limit, the equation reduces to theirs. 

Thus, up to the change of the diffusion con- 
stant, the stable distribution in this approximation 
is again the Kummer function [1 



n{x) 



A, 



2(7 -m) 

t^, 0, 5 X 



(19) 



where A is a normalization factor. From here, we 
can justify the neglect of the higher derivatives. 
The argument of the Kummer function is propor- 
tional to the small quantity 7 — /i, so that higher 
derivatives bring down additional factors of this 
small quantity and are thus indeed negligible. This 
stems from the fact that the drift term is small, 
and again hearkens back to the necessity of assum- 
ing slow growth and small mutation probability. 

The normalization factor can be obtained from 
the size of the entire population, which we denote 
No, since 



No 



m> 1 



xn{x)dx 



(20) 



This integral can be performed analytically [see [Tj 
Eq. 7.612.2], yielding (item 3) 



n{x) 



T{2 + v) U + 



27 



(21) 

It should be noted that this normalization is differ- 
ent from that of MZ. The fact that the distribution 
is proportional to ^ flows from the fact that for 
fjL = Q the distribution goes like l/x"^ and so the 
integral diverges as l//i. 

The form of the distribution means that cr^n is a 
function of the scaled family size m/cr^, for all off- 
spring distributions. This collapse is shown in Fig. 
|3] where the data for the Poisson and geometric 
offspring distributions is plotted, together with the 
analytic prediction. We see the analytic prediction 
is excellent except for the singletons, the families of 
size one. In any case, these are not expected to be 
given by the formula, since the governing equation 
for the singletons is exceptional. 

From the full population distribution, it is 
straightforward to generate, at least formally, the 



family size distribution, 
pie of size Rq: 



for the "red" subsam- 



E Vm) {r, 
Tat; 



(22) 



which reffects the hypergeometric probability of 
choosing m red members of an original family of 
size p, when choosing Ro from No- When Ro <^ No, 
the hypergeometric distribution reduces to a Pois- 
son distribution 



-pB.„/Na 



pRo 
No 



(23) 



We can replace the sum by an integral, yielding [see 
IZl Eq. 7.621.6], the result quoted in the 4th item 
above. 



Ro 



AC/ 1 + J/, 



27 



1 

ml 



« lyR^ B{2 + i^, to) s" X 

2F1 (to, TO + 1; 2 + + m; 1 - s) . (24) 

In Fig. [2]above, we have included together with the 
simulational results for the family size distribution 
for various size subsamples the predictions of our 



formula Eq. ( 24 1 . We see that for all but the largest 



families in the smallest subsample, there is excellent 
agreement, even for a subsample which is 1/20000 
of the whole. 

The first thing we can verify with our analytic 
formula is that the large to behavior of the dis- 
tribution is unchanged, except for normalization. 
For large m, the sum is dominated by p's of or- 
der mNo/Ro- Expanding the summand around this 
maximum yields a Gaussian, and we see that power- 
law is preserved. 

We now investigate our result in the two limits 
of strong and weak sampling, defined by s ^ 1 and 
s ^ 1 , respectively. In the limit of strong sampling 
we return to our fundamental expression for n. 



R 

m 7 



Eq. ( 23 1 . As we shall see momentarily, the typical 
scale of family size over which rim varies is large, 
of order s. Thus, the Poisson sum is essentially a 
Gaussian centered at i* — niNo/Ro, of width ^/i*. 
Since n„i itself varies over the scale I/7, which is 
much larger, to leading order the sum over £ reduces 
to a (5-function, and we have 



vR, 



T{2 + iy) 



U{1 + iy,0,m/s) 



(25) 



Alternatively, we can obtain this expression directly 
from our general formula for using the integral 
representation of 2F1 and taking the large s limit. 
The first derivation shows that the result is valid 
even for Ro ^ No, beyond the range of the Poisson 
sampling approximation, since the argument gener- 
alizes to the original hypergeometric sampling for- 
mula, Eq. ( 22 ) . This is evident from the fact that 



we recover the full population distribution if we sim- 
ply set Ro = No- However, the partial sample dis- 
tribution formula is reliable for all to, whereas for 
the full population, the formula does not apply to 
to's of order unity. As we noted above, this form 



for implies that as we vary Ro, the whole dis- 
tribution moves rigidly leftward and downward. In 
particular, for m s, U approaches a constant 
value, l/r(2 + i^), and 



i^Rc 

TO 



(26) 



Clearly for to 3> s, we recover the MZ power-law, 
as expected. We refer the reader back to Fig. |4] 
for a graphical presentation of the strong sampling 
regime. 

Since the shoulder of the distribution extends to 
to's of order s, clearly the shoulder vanishes by the 
time s reaches 1. We call this "critical" sampling, 
and in this case the distribution is given by 



vRcB{2 + iy,m) 



(27) 



In particular, for small v, this reduces to the simple 
form 



m{m + 1) 



(28) 



and show that in this case the distribution still ap- 
proaches the large to power law from below, but the 
power-law regime sets in already for to > 5 or so. 

We now turn to the weak sampling regime, s <C 1. 
Here, using the transformation formula [T] 



iFi{a,b; c;z) 



T{c)T{c- a-b) 
T{c-a)T{c-h) 



2Fi{a, b;a + b-c+l;l — z) 



, T{c)T{a + b-c) 
2Fi(c ~ a, c — b; c — a — b + I; I 



we note that for m > 1, the second term dominates 
for small s. Thus, we get to leading order 



(29) 



B{2+iy,m-l~v)iyRos'' 



m > 1 (30) 



so that the to > 1 distribution is independent of 
Ro, up to normalization. We see that as s — > 0, 
Tim/Ro vanishes for to > 1, since extremely dilute 
sampling will never encounter two individuals from 
the same family. Again, the small ly limit is partic- 
ularly simple 



vRo 
m(m — 1) 



TO > 1 



(31) 



We see clearly that the case to = 1 is exceptional, 
and requires a separate treatment. We also see 



that the distribution now approaches the asymp- 
totic power-law from above. The weak samphng 
regime is illustrated in Fig. |5] above. 

We have seen that the case of the singleton fam- 
ihes, m = 1, requires special attention, particularly 
for weak sampling. Using the transformation for- 
mula above, we can verify that in the weak sam- 
pling limit, s ^ 0, nf approaches Ro, consistent 
with the vanishing of the larger size families in this 
limit. However, this is misleading for small fj,. In 
this limit, the hypergeometric function can be ex- 
pressed in terms of elementary functions, and 



-RqV 
(1^ 



(l-s + lns) + C'(iy2) 



(32) 



Thus, for dilute sampling, nf / Ro is seen to be ap- 
proximately the small quantity —v\ii{s), The log- 
arithmic behavior is again a reflection of the slow 
1 /x^ decay of the whole population family distri- 
bution in the small /i limit. Only for ridiculously 
dilute samples, with Ro/Nq of order ^e^'^^^, does 
the whole subsample reduce to singletons. This 
is what we see in the graphs of the simulations 
in Fig. [2] where even for the smallest subsample 
shown, with s = 0.0222, there are only 73 single- 
tons in a population of 200. In the strong sam- 



pling regime, as can be seen from Eq. (26 1, we get 
nf « vRc, so it is independent of Ro- This con- 
vergence is also apparent in the data. Overall, the 
fraction of singletons decreases with the strength of 
the sampling. 

The last quantity of interest is the total number 
of red families, which we denote by . In principle 
we can calculate it as the sum over m red family size 
distribution n^, but it is easier to go back to the 
definition of in terms of Up and sum over m 
first, leaving the sum over p for last. The sum over 
m, if it started at m = would just give unity, but 
because it starts at m = 1, yields, in the Poisson 
approximation Ro <^ No'. 



F' 



p>i 







(33) 



Thus, all families in the full population give a fam- 
ily of some size in the red population, unless they 
are not picked at all, which occurs with probability 
exp{—pRo/No), where p is the size of the originat- 
ing family. Plugging in our expression for Up, and 



converting the sum to an integral yields 



F' 



A. 



U{ 1 + 1^,0, 
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dx 



(34a) 



^Rc 
{2 + v) 



jFi(l,l;3 + zy;l) 



(l-s)2i^i(l,l;3-|-i/;l-s) 

(34b) 

The details of this calculation are presented in Ap- 
pendix [C] Again, formally, for any finite we get 
that _F ^approaches _Ro as s ^ 0, as all families 
are singletons. Nevertheless, for small, but not ab- 
surdly small, V, the number of families reads 



F' 



-vRr 



; In i 



1 



(35) 

Ro/F^ is given 



Thus, the average family size, m : 
for small by 

m = -^-^ (36) 
vm s 

which is large but decreases logarithmically at small 
sampling, cutting off at one for extremely small 
samples, density s. It is interesting that whereas 
both the number of red families and the number of 
red singletons behave anomalously for small sam- 
ples and small fi, the fraction of red families that 
are singletons is smooth, approaching unity as it 
should. 

At critical sampling, F^ — i'Rc/{l + i'), and m = 
(1 -|- i^)/i^, which is large for small v. For large s, 
and general v, the number of families is dominated 
by the contribution from the small families, which 
behaves as l/m and so is logarithmically large: 



i/Rr In ( 



where the 0(1) constant a is given by 



Ina = Tpil) - ip{2 + ly) + 



1 



(37) 



(38) 



and Tp is the digamma function, so that a ap- 
proaches 1 for small i'. Thus the average family 
size diverges for large s, 

(39) 



J/ In as 

in agreement with what we found for small v. 



4. Red Statistics Through Time 

4-1- Red Population 

We now present an alternative derivation of these 
results (at least in the small ;y limit), based on trac- 
ing the time development of the red genealogy. We 
have labeled an individual "red" if he is in the se- 
lected sample of the final population. We also label 
as red any ancestor of such an individual. We first 
address the question of the time development of the 
red population. The basic tool for the analysis, as 
with the original Galton- Watson work, is the gener- 
ating function for the distribution of children, which 
we denote by 



F(x) = ^P(n)a 



(40) 



rj=0 



The key is the Galton- Watson observation that the 
generating function for the probability of descen- 
dants in the second generation, F2 (x) is the second 
iterate of F; i.e., F{F{x)). We can see this noting 
that 



P2{n)=y^P{i)P\n) 



i=0 



(41) 



so that 



F-,{x) = ^P(i)P(i-n)a;" = ^P(i)[F(x)]^ 

i,n i 

= nn^)) (42) 

Generalizing, the generating function for the 
probabihty of descendants in the n'th generation, 

Fn{x) = F{Fn-l{x)). 

We now need to include the effects of sampling. 
We ask what the probability Qn that a person n 
generations before the end has no red descendent 
in the final sample. This is just the sum of the 
probabilities that it had k descendants, and that 
none of these were picked to be in the sample: 



Qn = $]^n(fc) (1 
fe=0 



Ro 
No 



F„, 



Ro 
No 



(43) 

where the final population is No and the sample 
size is Rg. The red population n generations before 
the end is then 



Rn = iVoA-"(l-Q„) = No\-" 1 - F„ 1 - 



Ro 
N~o 
(44) 



This is the exact answer. The function F„ is 

what appears in the Galton- Watson theory, and 
Fn{x) approaches the Galton- Watson-Steffensen 
fixed point (which exists for all A > 1) for all x < 1. 
This implies there is an interesting change of be- 
havior of Rn depending on whether 1 — Ro/Nq is 
greater, smaller or equal to the fixed point value. At 
the fixed point, the percentage of reds in the general 
population is constant as a function of time. For a 
smaller sample, the ratio increases as we move back 
in time, starting from the small initial value. For 
a larger sample, the situation is reversed, and the 
ratio decreases to the asymptotic Galton- Watson 
survival probability as we go back in time. 

In order to get a more useful expression, we will 
specialize to our hmit of A close to 1, i.e., 7 <C 1. We 
first calculate the Galton- Watson (GW) fixed point 
in this limit. Since for zero growth, the GW fixed 
point is unity, it is close to this for 7 small. Writing 
Qoo = l — S, the fixed-point equation Qoo = F{Qoo) 
reads 



1-S = F{l-d) « [0-2 - A(l - A)] (45) 

where as before a is the variance of the children 
distribution, so that 



27 
^2 



(46) 



We thus see the origin of the "critical" value 
of sampling we encountered in the previous sec- 
tion, namely the change in behavior depending on 
whether Ro/No is less than or greater than 6 « 
27/cr^; i.e., on whether Ro is less than or greater 
than 2jNo/a'^, which is Rc to leading order in u. 
For the remainder of this section, will in fact 
rewrite 2^No/cr'^ = Rc, as we are working only to 
leading order in v. 

As long as x is close to one, F{x) will similarly be 
close to one. Thus, for Ro <C No, 1 — Ro/No meets 
this criterion and we can approximate the change 
in 5n = Rn/Nn = 1 — Qn, the fraction of reds in 
the population. 



Un+l — 



1-F{1- 6n) ^XSn- f {a^ - A(l 



(1 + l)S„ 



5W 



so that 
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(48) 
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Figure 7: The number of red individuals as a function of the 
number of generations in the past, R(n), for a single run, 
with A^o = 4 ■ 10'', 7 = 5- IQ-^, fj. = 5 ■ 10"''. The sample 
sizes are i?o = 2000, 4- 10'', and 8- 10^, which are subcritical, 
critical and supercritical respectively. Also shown as solid 
lines are the predictions of Eq. JSOll . 



with the solution 



Sn 



Ro<J^ 



2-fRo 

{2jNo ~ i?oCr2)e-T" 



SO that 



Roe-<'' + (i?e - Ro) 



Ro 



(49) 



(50) 



The change in behavior as R^, crosses Rc is appar- 
ent. In particular, at critical sampling, R(n) is 
a pure exponential in time. We also see that i?„ 
depends on the underlying distribution of children 
only through its average, i.e. 7, and its variance. 

In Fig. [7] we show data for the R{n) from a single 
simulation, for three different sampling strengths, 
together with our analytic formula, Eq. ( 50 1 . The 



data exhibit the striking change in behavior from 
a fast initial decrease in red individuals for strong 
sampling, a pure exponential for critical sampling 
and a slow initial decrease for weak sampling. All 
three curves merge in the past, at the coalescence 
time for the entire population. 

An alternate, equivalent way to arrive at our 
result is to consider the coalescence of branches 
on the red tree. If we look at the i?„ reds in 
the nth previous generation, these are generated 
by slightly less than i?„ parents, due to coales- 
cence. The chances of coalescence, per potential 



parent, is the chance of having two surviving chil- 
dren, ^P(n)^^^^(52 ^ ^2^2/2. Thus, the de- 
crease in the number of reds is a^R^/2N so that 
the equation for R is 



dR 

dn 



2N 



(51) 



which is equivalent to our above equation for 5. 

4-2. Red Families 

We now turn to examine the time development 
of the number of red families, F^{n), n steps in the 
past . As in the absence of mutation every family 
survives, since it is red, the only change in families 
from n steps in the past to n — 1 steps in the past 
is the new families due to mutation. As mutations 
of singletons do not create new families, we have 

F^{n - 1) = F^in) + ^i{R{n) - nf (n)) . (52) 

In the small /x limit, the number of singletons is 
small, proportional to /i. Thus to leading order in 
/Lt, we have 

F^{n - 1) = F^{n) + iiR{n) (53) 
or, in its differential equation form, 
dF" 



dn 



-^iR 



(54) 



Using our solution, Eq. (50 1, for R{n), we obtain 

R. 



F'^n) = 



Rr — Rn 



Ro 



-1 e 



(55) 



where we have demanded that F^ — > as n — > cxd. 
In particular, at the sampling time. 



F^(0) 



vRcRo , Rc 
In ■ 



Rc — Ro Ro 



(56) 



which of course agrees with the small /i limit re- 
sult Eq. ([36]) we derived using the Fokker-Planck 
approach. 

In Figure m we show data for the number of red 
families going backward in time, again for subcriti- 
cal, critical and supercritical sampling. The small 1/ 



result, Eq. (551, is also shown. The small deviation 



is due to higher order corrections in v. 
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Figure 8: The number of red families as a function of the 
number of generations in the past, F^{n), for a single run, 
with Afo = 4 ■ 10'', 7 = 5- IQ-^, fj. = 5 ■ lO"". The sample 
sizes are Ro = 2000, 4- 10'', and 8- 10^, which are subcritical, 
critical and supercritical respectively. Also shown as solid 
lines are the small u approximations from Eq. (|55|. 



Figure 9: The number of red singletons as a function of the 
number of generations in the past, n^{n), for a single run, 
with No = 4- 10^, 7 = 5- 10-^, /i = 5 ■ lO"-*. The sample 
sizes are Ro = 2000, 4- 10*, and 8- 10^, which are subcritical, 
critical and supercritical respectively. Also shown as solid 
lines are the small u approximations from Eq. ||58|l. 



4-.3. Red Singletons 

We now examine the time development of the 
number of red singletons. The number of single- 
tons decreases due to the fact that some singletons 
give birth to multiple children, who are no longer 
singletons. It increases due to the fact that mu- 
tations of non-singletons give rise to new single- 
tons. The latter factor is very simple: — nf^), 
just as with red families. As discussed above, the 
decreases in number of reds due to coalescence is 
a^R{n)'^ /2N{n). Of these a fraction nf^/R are sin- 
gletons, so the loss of singletons due to coalescence 
of singletons is a Rn^/2N. Thus the differential 
equation for rii is 



dn 



^Rnf 



M(i?-nf) 



(57) 



Again we can drop the ^nf^ term as being higher 
order in /x. Then the solution that vanishes as n ^ 
oo is 



.f(t) 



vRcRo 



{Rc 
(Roe-''' 



Roy 



{Rc — Ro 



(Rc — Ro) 



In 1 



- 1 



(58) 



In particular, at the sampling time the number of 
red singletons is given by 



nf (0) = 



vRcRo 
{Rc - RoY 



Rrll-l 



Rc 
Ro 



{Rc — Ro) 



(59) 

This again can be seen to reproduce our Fokker- 
Planck result, Eq. (32 1. It is interesting to note 
that at critical sampling nf approaches vRcl2, 
whereas approaches vRc, so that half the fami- 
lies are singletons in this case. The simulation data 
for nf{n) is presented in Fig. [oj together with the 
small v prediction, Eq. ( [SS] ). The picture is quali- 
tatively similar to that of the red families. 

5. The U.S. Census Data 

We now attempt to apply the theory to the sur- 
name distribution taken from the 2000 U.S. Cen- 
su^ We must make clear at the outset that degree 
of overlap between the assumptions underlying our 
theoretical treatment and the real dynamics of the 



'Available at http://www.census.gov/genealogy/ww w/freqnames2k.htmll 
This data only extends to surnames with 
more than 100 representatives. Data for rarer 

surnames is available in binned form from 
http: / / www . census . gov/ genealogy / www / surnames . pdf| 



The binned 
procedure 



data was then debinned using a smoothing 
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Figure 10: The surname distribution taken from the 2000 
U.S. census. Also show is a curve outlining a 1/ m? power-law 
decay, and the theoretical curve, Eq. |7| for A^o = 2.8 ■ 10*, 
7 = 5- lO--*, and ii = 10"". 



surname distribution may rightfully be questioned. 
Nevertheless, we will proceed and see how far we 
can go. 

The surname data is presented in Fig. [TO] We see 
that its overall shape is similar to that of the the- 
ory. In particular, the graph exhibits a 1 /iv? fallofF 
for large family sizes, in accord with our expecta- 
tion. However, upon closer examination, the data 
presents us with a severe problem. The asymptotic 
power law only sets in for family sizes above 10* or 
so. According to the theory, the onset of the power 
law should occur roughly at m ^ 10/(27). Thus, 
the data would point to a value of 7 of roughly 
5 • 10~*. However, this is off from the true growth 
rate of the population by a factor of 1000! 

The growth rate of the U.S. population has not 
been constant in time. However, it was quite con- 
stant up until the severe immigration restrictions 
were applied in the wake of World War I, as can be 
with a value of 0.0255/year. As- 
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seen from Fig. 
suming a generational time of 20 years, this works 
out to give a value of 7 = 0.51. Even taking the 
post-WWI growth rate, the growth rate is only re- 
duced to 7 = 0.3, so we still have a factor of 500 
to account for. Even if we compare the data to the 
theoretical curve with 7 = 5- 10"''^, so as to repro- 
duce the break from the power-law at m — 10*, 
the agreement only extends down to family sizes of 
m = 500. 

One might think that the problem is that the 
U.S. is not a demographically closed system. The 



Figure 11: The population of the U.S. as a function of time, 
together with an exponential fit to the period prior to 1920 



large impact of immigration in driving the popula- 
tion growth before WWI is clear evidence of this. 
One might then want to claim that the U.S. popu- 
lation should be considered as a (biased) sample of 
the European (or perhaps, world) population. The 
growth rate of the European and world population 
before 1920 is roughly a factor of 5 smaller than 
the U.S. growth rate. However, this improvement is 
more than counterbalanced by the sampling effect, 
which is to move the distribution leftward by the 
sampling factor, thereby moving the onset of the 
asymptotic 1/m^ scaling to even smaller m. Thus, 
the solution cannot lie in this direction. 

In fact, no clear answer presents itself at this 
point, and we must leave this puzzle as a challenge 
for the future. One possibility might be that the 
"mutation" of family names, at least for the U.S., 
is likely very different from that assumed by the 
model. There was a great tendency in the period 
of the great immigration for family name changes 
not to create new surnames, as we assume, but in 
fact the opposite. The immigration officials were 
(in)famous for mapping the wide spectrum of sur- 
names they encountered onto recognizable "Amer- 
ican" names. Further work will be needed to see 
what effect this might have had on the surname 
distribution. 

In summary, then, we have calculated the fam- 
ily size distribution for a growing population. We 
have shown that the distribution is universal for 
slow growth rate. In addition, we have calculated 
the distribution for arbitrarily sized subsamples of 
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the population. We found a distinction between 
strong sampling, where the shoulder at small fam- 
ily sizes, while truncated still exists, and weak sam- 
pling, where the small family distribution lies above 
the power-law. The critical sampling dividing these 
two regimes is Rc = 2^No/ (^"^{1 + v)- In the strong 
sampling regime, the distribution is rigidly trans- 
lated (in log space) through a rescaling of m. For 
weak sampling, the distribution is independent of 
sampling, up to overall normalization. We have also 
seen that the distinction between weak and strong 
sampling holds for the properties of the genealogical 
tree constructed from the sampled individuals. 

This work was supported by the EU 6th frame- 
work COS pathfinder. Y.M. acknowledge the finan- 
cial support of the Israeli Center for Complexity 
Science. 

A. Derivation of the Power-Law 

For large m, the major contribution of the sum 
over p is from the vicinity of p.^ = m/(l — /i) and 
for the sum over £ from the vicinity of = /A = 
m*/((l — /x)A). Thus, we can, invoking the central 
limit theorem, approximate the distribution P(£ 
p) by 

P{l^p)^^=.e-^^ (60) 

Similarly, we can approximate the binomial distri- 
bution for the number of mutations by 

P \ 1 („-.p(l-^))2 

^ (1 - /x)"V^'-™ « , , e- ^M^-^)^ 

mj V27r^(l - ^i)p 

(61) 

Replacing the sums over I and p by integrals and 
expanding the exponent to second order yields 



Mdp n\ 



1 



exp 



27rcr-\/^p/i(l — p) 
{p-Mf {m,-p{\-^i)f 
2cr2^ 2^(1 - /i)p 



^^^^"#n* lldtdpe^^''^^ (62) 



where 

T{i,p) = - 



2'Ki7m^pL 

A3(l-^)(£-m,)2 



2cr^m 

{l-^Ji){\^l + a\l-^l)){p-p,f 

2fim 

XHl-ti)i£~£,){p-p,) 



Substituting = Am ^X* and doing the Gaus- 
sian integrals gives 

-0 

(64) 



1 



A(l-Ai) VA(l-/i). 

Equivalently, we could simply replace the Gaussians 
by the i5-functions S{p — m£)S{m — (1 — /i)p) and 
integrate over £ and p successively. Taking the log- 



arithm gives us our equation for /3, Eq. (16 1 



B. The Fokker-Planck Equation 



We start with Eq. (15a I, and write P{£ p) in 

Q, F: 



terms of the generating function, F 

dz 



P{£ -^p) = 



(65) 



givmg 



i>0 
p'>m 



(66) 



We next expand rq in a Taylor series around n* 



dn 



1 



nl-nl + -i£-m) + -—^i£-mf + ... (67) 

We can now do the geometric sums over £ and the 
sum over p using 



E 

p'>m 



(1 - x)"+i 



(68) 



to get 



dz (1 - At)™ 
27ri (z - /i)™+i 



n — n'm + n"m^/2 
1 - F{^) 



(n' - mn")F{z) {n" /2)F{z){l + F{z)) 



(1-F(z))2 
n — n'm + rt"m^/2 



(1-F(z))3 



a(i-m) 

(n' - mn") [(A^ - /2)(1 - /i) + (m + 1)A] 



n"/2 



A3(1-m)2 
-6A(1-m)'/3 



A5(l-/.)3^ 

+ 12(1 - ^) VI - 6A(1 - pfXh 
+ 6A(1 - fi){m + l)/2 + A4(1 - ^if 

- 3X^(1 - ^)(to + 1) + (to -t- 2)(to + 1)A2 



(63) 
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(69) 



where we have set the contour between the pole at 
z = jjL and that at z = 1 and picked up the residue 
at the outer pole at z = 1, and the expansion of 
F{z) near z = 1 is 

F{z) « l+\{z~l)+j2{z-lf+h{z-lf+. . . (70) 

It is possible to translate this difference equation to 
a differential equation only in the limit 7 ~ /x <C 
1. Then, expanding the coefficients of the various 
derivatives of n to leading order, things simplify to 

h{m) = — (7 — fJ.)n+ [—(7 — + 2/2] ?i' + f2mn" 

(71) 

Using that, to leading order in 7, /2 — gives 
us Eq. plf. 



C. The Number of Red Families 



The number of red families is, from Eq. (|34al 

-pi?,„/Af„ 



1 



dx 

^-[/( 1 + 1^,0, 



27 



(72) 



We cannot break up the term in the parenthesis 
as is, since both integrals would then diverge. We 
therefore regularize the problem: 



F' 



(1 - e-") x'^-'^U (1 + V, 0, x) dx [14 
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AlimI(^HXi±l), 
€^qT{2 + v + e) 
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